Search CORE

32 research outputs found

The MEANING Project

Author: Agirre Bengoa Eneko
Atserias Batalla Jordi
Rigau Claramunt German
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2003
Field of study

A pesar del progreso que se realiza en el Procesamiento del Lenguaje Natural (PLN) aún estamos lejos de la Comprensión del Lenguaje Natural. Un paso importante hacia este objetivo es el desarrollo de técnicas y recursos que traten conceptos en lugar de palabras. Sin embargo, si queremos construir la próxima generación de sistemas inteligentes que traten Tecnología de Lenguaje Humano en dominios abiertos necesitamos resolver dos tareas intermedias y complementarias: resolución de la ambigüedad léxica de las palabras y enriquecimiento automático y a gran escala de bases de conocimiento léxico.Progress is being made in Natural Language Processing (NLP) but there is still a long way towards Natural Language Understanding. An important step towards this goal is the development of technologies and resources that deal with concepts rather than words. However, to be able to build the next generation of intelligent open domain Human Language Technology (HLT) application systems we need to solve two complementary and intermediate tasks: Word Sense Disambiguation (WSD) and automatic large-scale enrichment of Lexical Knowledge Bases.The MEANING Project is funded by the EU 5th Framework IST Programme

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Towards zero-shot cross-lingual named entity disambiguation

Author: Agirre Bengoa Eneko
Barrena Madinabeitia Ander
Soroa Echave Aitor
Publication venue: 'Elsevier BV'
Publication date: 01/12/2021
Field of study

[EN]In cross-Lingual Named Entity Disambiguation (XNED) the task is to link Named Entity mentions in text in some native language to English entities in a knowledge graph. XNED systems usually require training data for each native language, limiting their application for low resource languages with small amounts of training data. Prior work have proposed so-called zero-shot transfer systems which are only trained in English training data, but required native prior probabilities of entities with respect to mentions, which had to be estimated from native training examples, limiting their practical interest. In this work we present a zero-shot XNED architecture where, instead of a single disambiguation model, we have a model for each possible mention string, thus eliminating the need for native prior probabilities. Our system improves over prior work in XNED datasets in Spanish and Chinese by 32 and 27 points, and matches the systems which do require native prior information. We experiment with different multilingual transfer strategies, showing that better results are obtained with a purpose-built multilingual pre-training method compared to state-of-the-art generic multilingual models such as XLM-R. We also discovered, surprisingly, that English is not necessarily the most effective zero-shot training language for XNED into English. For instance, Spanish is more effective when training a zero-shot XNED system that dis-ambiguates Basque mentions with respect to an English knowledge graph.This work has been partially funded by the Basque Government (IXA excellence research group (IT1343-19) and DeepText project), Project BigKnowledge (Ayudas Fundacion BBVA a equipos de investigacion cientifica 2018) and via the IARPA BETTER Program contract 2019-19051600006 (ODNI, IARPA activity). Ander Barrena enjoys a post-doctoral grant ESPDOC18/101 from the UPV/EHU and also acknowledges the support of the NVIDIA Corporation with the donation of a Titan V GPU used for this research. The author thankfully acknowledges the computer resources at CTE-Power9 + V100 and technical support provided by Barcelona Supercomputing Center (RES-IM-2020-1-0020)

Archivo Digital para la Docencia y la Investigación

Hitzen adiera-desanbiguazioa

Author: Agirre Bengoa Eneko
López de Lacalle Lecuona Oier
Publication venue: Servicio Editorial de la Universidad del País Vasco/Euskal Herriko Unibertsitatearen Argitalpen Zerbitzua
Publication date: 01/01/2010
Field of study

Gure hizkuntza anbiguoa da. Hitz batek hainbat interpretazio ditu agertzen den testuinguruaren arabera, eta zein adiera hartzen duen asmatzea ez da lan erraza, nahiz eta guk era naturalean egin. konputazio-metodoak erabiliz hitzen agerpenei adiera egokia ematea hitzaren adiera-desanbiguazioa (HAD) deritzo. HAD automatikoa ezagutzan oinarritzen da, eta ezagutza hori hainbat iturritatik lor dezake: adierez etiketatutako corpus batetik hasita ontologietaraino. Zoritxarrez, baliabide hauen sortze-prozesua garestia da eta ezagutza bereganatzearen arazo bezala ezagutzen da. Eragozpenak gaindituta HAD teknologiak heldutasuna lortzen duen unean informazioa atzitzeko dugun modua erabat aldatuko da, Web Semantikoari ateak zabalduz. Lengoaia Naturalaren Prozesamendurako tresnetan ere lagungarri da HAD automatikoa, Itzulpen Automatikoan adibidez, izan ere, bai polisemia eta bai sinonimia arazoak automatikoki gainditzeko balio baitezake. Artikulu honetan hitzen adiera-desanbiguazioaren sarrera orokor bat egiten dugu eta, bereziki, adierez etiketatutako corpusetan oinarritzen diren metodoei erreparatuko diegu. Erakutsiko dugu hitzen adiera-desanbiguazioa ataza zaila dela, hizkuntzaren konplexutasunari aurre egin eta testu hutsetik egitura semantikoa antzeman behar duelako

Archivo Digital para la Docencia y la Investigación

Traducción Automática Neuronal no Supervisada, un nuevo paradigma basado solo en textos monolingües

Author: Agirre Bengoa Eneko
Artetxe Mikel
Labaka Intxauspe Gorka
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2019
Field of study

This article presents UnsupNMT, a 3-year project of which the first year has already been completed. UnsupNMT proposes a radically different approach to machine translation: unsupervised translation, that is, translation based on monolingual data alone with no need for bilingual resources. This method is based on deep learning of temporal sequences and uses cutting-edge interlingual word representations in the form of cross-lingual word embeddings. This project is not only a highly innovative proposal but it also opens a new paradigm in machine translation which branches out to other disciplines, such us transfer learning. Despite the current limitations of unsupervised machine translation, the techniques developed are expected to have great repercussions in areas where machine translation achieves worse results, such as translation between languages which have little contact, e.g. German and Russian.Este artículo presenta UnsupNMT, un proyecto de 3 años del que ha trascurrido la primera anualidad. UnsupNMT plantea un método radicalmente diferente de hacer traducción automática: la traducción no supervisada, es decir, basada exclusivamente en textos monolingües sin ningún recurso bilingüe. El método propuesto se basa en aprendizaje profundo de secuencias temporales combinado con los últimos avances en representación interlingual de palabras (“cross-lingual word embeddings”). Además de ser una propuesta propiamente innovadora, abre un nuevo paradigma de traducción automática con ramificaciones en otras disciplinas como el aprendizaje por transferencia (“transfer learning”). A pesar de las limitaciones actuales de la traducción automática no-supervisada, se espera que las técnicas desarrolladas tengan gran repercusión en áreas donde la traducción automática consigue peores resultados, como la traducción entre pares de idiomas con poco contacto, tales como alemán o ruso.UnsupNMT is a project funded by the Spanish Ministry of Economy, Industry and Competitiveness (TIN2017-91692-EXP)

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Euskarako ezagutza-base lexiko-semantikoaren eredu-hautaketa eta garapena: EuskalWordNet

Author: Agirre Bengoa Eneko
Aldezabal Roteta Izaskun
Pociello Irigoyen Elisabete
Publication venue: Servicio Editorial de la Universidad del País Vasco/Euskal Herriko Unibertsitatearen Argitalpen Zerbitzua
Publication date: 01/01/2005
Field of study

Natural Language Processing techniques need to develop lexical-semantic knowledge bases (LSKB) in order to perform semantic interpretation. The IXA group decided to develop a Basque LSKB called EuskalWordNet for this reason. EuskalWordNet is based on WordNet and its multilingual counterparts EuroWordNet and the Multilingual Central Repository (MCR). This paper reviews the theoretical and practical aspects of the EuskalWordNet LSKB, as well as the steps followed in its construction

Archivo Digital para la Docencia y la Investigación

Euskarako ezagutza-base lexiko-semantikoaren eredu-hautaketa eta garapena: EuskalWordNet

Author: Agirre Bengoa Eneko
Aldezabal Roteta Izaskun
Pociello Irigoyen Elisabete
Publication venue: Servicio Editorial de la Universidad del País Vasco/Euskal Herriko Unibertsitatearen Argitalpen Zerbitzua
Publication date: 01/01/2005
Field of study

Archivo Digital para la Docencia y la Investigación

Image captioning for effective use of language models in knowledge-based visual question answering

Author: Agirre Bengoa Eneko
Azkune Galparsoro Gorka
López de Lacalle Lecuona Oier
Salaberria Saizar Ander
Soroa Echave Aitor
Publication venue: 'Elsevier BV'
Publication date: 01/02/2023
Field of study

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.Ander is funded by a PhD grant from the Basque Government (PRE_2021_2_0143). This work is based upon work partially supported by the Ministry of Science and Innovation of the Spanish Government (DeepKnowledge project PID2021-127777OB-C21), and the Basque Government (IXA excellence research group IT1570-22)

Archivo Digital para la Docencia y la Investigación

Traducción automática basada en tectogramática para inglés-español e inglés-euskara

Author: Agirre Bengoa Eneko
Alegría Loinaz Iñaki
Aranberri Nora
Díaz de Ilarraza Sánchez Arantza
Jauregi Oneka
Labaka Intxauspe Gorka
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2016
Field of study

Presentamos los primeros sistemas de traducción automática para inglés-español e inglés-euskara basados en tectogramática. A partir del modelo ya existente inglés-checo, describimos las herramientas para el análisis y síntesis, y los recursos para la trasferencia. La evaluación muestra el potencial de estos sistemas para adaptarse a nuevas lenguas y dominios.We present the first attempt to build machine translation systems for the English-Spanish and English-Basque language pairs following the tectogrammar approach. Based on the English-Czech system, we describe the language-specific tools added in the analysis and synthesis steps, and the resources for bilingual transfer. Evaluation shows the potential of these systems for new languages and domains.The research leading to these results has received funding from FP7-ICT-2013-10-610516 (QTLeap project, qtleap.eu)

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas